Extracting Bilingual Persian Italian Lexicon from Comparable Corpora Using Different Types of Seed Dictionaries

نویسندگان

  • Ebrahim Ansari
  • Mohammad Hadi Sadreddini
  • Lucio Grandinetti
  • Mehdi Sheikhalishahi
چکیده

Ebrahim Ansari ([email protected]) et al. 2017. Extracting bilingual per-sian italian lexicon from comparable corpora using different types of seed dictionaries. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lexicons from non-parallel (comparable) corpora have been proposed. Almost all use a small existing dictionary or other resource to make an initial list called the " seed dictionary ". In this paper we discuss the use of different types of dictionaries as the initial starting list for creating a bilingual Persian-Italian lexicon from a comparable corpus. Our experiments apply state-of-the-art techniques on three different seed dictionaries ; an existing dictionary, a dictionary created with pivot-based schema, and a dictionary extracted from a small Persian-Italian parallel text. The interesting challenge of our approach is to find a way to combine different dictionaries together in order to produce a better and more accurate lexicon. In order to combine seed dictionaries, we propose two different combination models and examine the effect of our novel combination models on various comparable corpora that have differing degrees of comparability. We conclude with a proposal for a new weighting system to improve the extracted lexicon. The experimental results produced by our implementation show the efficiency of our proposed models.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adapted Seed Lexicon and Combined Bidirectional Similarity Measures for Translation Equivalent Extraction from Comparable Corpora

An improved method for extracting translation equivalents from bilingual comparable corpora according to contextual similarity was developed. This method has two main features. First, a seed bilingual lexicon—which is used to bridge contexts in different languages—is adapted to the corpora from which translation equivalents are to be extracted. Second, the contextual similarity is evaluated by ...

متن کامل

Utilizing Contextually Relevant Terms in Bilingual Lexicon Extraction

This paper demonstrates one efficient technique in extracting bilingual word pairs from non-parallel but comparable corpora. Instead of using the common approach of taking high frequency words to build up the initial bilingual lexicon, we show contextually relevant terms that co-occur with cognate pairs can be efficiently utilized to build a bilingual dictionary. The result shows that our model...

متن کامل

A Combination of Models for Bilingual Lexicon Extraction from Comparable Corpora

In this paper we present a method to extract bilingual terminologies from comparable non-aligned corpora, by using multiple linguistic knowledge sources, such as: non-parallel corpora, bilingual thesauri, a preliminary bilingual dictionary, etc... We focus on two core technologies: bilingual lexicon extraction from comparable corpora and expansion through thesauri categories based on different ...

متن کامل

Assisting Translators in Indirect Lexical Transfer

We present the design and evaluation of a translator’s amenuensis that uses comparable corpora to propose and rank nonliteral solutions to the translation of expressions from the general lexicon. Using distributional similarity and bilingual dictionaries, the method outperforms established techniques for extracting translation equivalents from parallel corpora. The interface to the system is av...

متن کامل

Automatic Generation of Bilingual Dictionaries Using Intermediary Languages and Comparable Corpora

This paper outlines a strategy to build new bilingual dictionaries from existing resources. The method is based on two main tasks: first, a new set of bilingual correspondences is generated from two available bilingual dictionaries. Second, the generated correspondences are validated by making use of a bilingual lexicon automatically extracted from non-parallel, and comparable corpora. The qual...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.08340  شماره 

صفحات  -

تاریخ انتشار 2017